Here is an outline of the steps we'll follow:
In this project, we will use the 'Used Cars Dataset' Austin Reese, available on Kaggle scraped from the classified advertisements website Craigslist.
The dataset contains one csv file named vehicles.csv with a size of 1.45 GB.
Before we dive into our exercise, Let's look at the columns we are going to analyze.
The dataset contains a total of 26 columns, listed below are the 17 columns are used to our analysis.
region: Region from where the listing is made.price:Asking price for the vehicle in the listing.year:Year of registration of the vehicle listed.manufacturer: Make of the vehicle listed.model: Model name of the vehicle listed.condition: Condition of the vehicle listed.cylinders:Engine size, based on the number of cylinders it has.odometer: The number of miles on the odometer of the vehicle.title_status: Contains the title status of the vehicle. Vehicle titles are certificates for legal ownership of a vehicle.transmission:The type of transmission on the vehicle.drive: Contains information about how the drive train delivers its power eg. AWD, FWD, RWD etc.size: Which size category the vehicle falls in.type Separates the vehicles on the basis of their type, eg. Hatchback, Pickup, Sedan etc.lat: Latitude of from where the listing is made.long: Longitude of from where the listing is made.posting_date: Date of when the listing was made.state: State code of where the listing is made.In this section, the chosen dataset is downloaded from Kaggle. This is done using the opendatasets library. On obtaining the dataset we read it using pandas and study the thus obtained pandas dataframe. Some of the important columns, that we plan to use in our analysis, are selected whereas the others are dropped so that the execution times are reduced.
pip install pyyaml==5.4.1
Requirement already satisfied: pyyaml==5.4.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (5.4.1) Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip available: 22.2.2 -> 22.3.1 [notice] To update, run: python.exe -m pip install --upgrade pip
pip install numpy pandas-profiling matplotlib seaborn folium opendatasets geopandas --upgrade
Requirement already satisfied: numpy in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (1.23.5) Requirement already satisfied: pandas-profiling in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (3.5.0) Requirement already satisfied: matplotlib in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (3.6.2) Requirement already satisfied: seaborn in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (0.12.1) Requirement already satisfied: folium in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (0.13.0) Requirement already satisfied: opendatasets in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (0.1.22) Requirement already satisfied: geopandas in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (0.12.1) Requirement already satisfied: pydantic<1.11,>=1.8.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (1.10.2) Requirement already satisfied: jinja2<3.2,>=2.11.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (3.1.2) Requirement already satisfied: PyYAML<6.1,>=5.0.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (5.4.1) Requirement already satisfied: multimethod<1.10,>=1.4 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (1.9) Requirement already satisfied: typeguard<2.14,>=2.13.2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (2.13.3) Requirement already satisfied: scipy<1.10,>=1.4.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (1.9.3) Requirement already satisfied: htmlmin==0.1.12 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (0.1.12) Requirement already satisfied: tqdm<4.65,>=4.48.2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (4.64.1) Requirement already satisfied: requests<2.29,>=2.24.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (2.28.1) Requirement already satisfied: statsmodels<0.14,>=0.13.2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (0.13.2) Requirement already satisfied: phik<0.13,>=0.11.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (0.12.2) Requirement already satisfied: pandas!=1.4.0,<1.6,>1.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (1.5.0) Requirement already satisfied: visions[type_image_path]==0.7.5 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (0.7.5) Requirement already satisfied: networkx>=2.4 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (2.8.7) Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (0.2.0) Requirement already satisfied: attrs>=19.3.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (21.4.0) Requirement already satisfied: Pillow in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (9.2.0) Requirement already satisfied: imagehash in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (4.3.1) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (2.8.2) Requirement already satisfied: fonttools>=4.22.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (4.37.3) Requirement already satisfied: contourpy>=1.0.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (1.0.5) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (21.3) Requirement already satisfied: cycler>=0.10 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (0.11.0) Requirement already satisfied: branca>=0.3.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from folium) (0.5.0) Requirement already satisfied: click in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from opendatasets) (8.1.3) Requirement already satisfied: kaggle in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from opendatasets) (1.5.12) Requirement already satisfied: pyproj>=2.6.1.post1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from geopandas) (3.4.0) Requirement already satisfied: shapely>=1.7 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from geopandas) (1.8.5.post1) Requirement already satisfied: fiona>=1.8 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from geopandas) (1.8.22) Requirement already satisfied: certifi in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (2022.9.24) Requirement already satisfied: munch in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (2.5.0) Requirement already satisfied: six>=1.7 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (1.16.0) Requirement already satisfied: setuptools in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (63.2.0) Requirement already satisfied: cligj>=0.5 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (0.7.2) Requirement already satisfied: click-plugins>=1.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (1.1.1) Requirement already satisfied: colorama in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from click->opendatasets) (0.4.5) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from jinja2<3.2,>=2.11.1->pandas-profiling) (2.1.1) Requirement already satisfied: pytz>=2020.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas!=1.4.0,<1.6,>1.1->pandas-profiling) (2022.2.1) Requirement already satisfied: joblib>=0.14.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from phik<0.13,>=0.11.1->pandas-profiling) (1.2.0) Requirement already satisfied: typing-extensions>=4.1.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pydantic<1.11,>=1.8.1->pandas-profiling) (4.4.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling) (1.26.12) Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling) (2.1.1) Requirement already satisfied: idna<4,>=2.5 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling) (3.4) Requirement already satisfied: patsy>=0.5.2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from statsmodels<0.14,>=0.13.2->pandas-profiling) (0.5.3) Requirement already satisfied: python-slugify in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from kaggle->opendatasets) (6.1.2) Requirement already satisfied: PyWavelets in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from imagehash->visions[type_image_path]==0.7.5->pandas-profiling) (1.4.1) Requirement already satisfied: text-unidecode>=1.3 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from python-slugify->kaggle->opendatasets) (1.3) Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip available: 22.2.2 -> 22.3.1 [notice] To update, run: python.exe -m pip install --upgrade pip
pip install plotly==5.3.1
^C Note: you may need to restart the kernel to use updated packages.
#import libraries
import plotly.express as px
import opendatasets as od
import pandas as pd
#First Go To Kraggle and Signin and create new api and keep the api.json in the project folder
#Download the dataset from kaggle by using link
used_cars_url="https://www.kaggle.com/austinreese/craigslist-carstrucks-data"
#downloads the dataset associated with the link provided
od.download(used_cars_url)
Skipping, found downloaded files in ".\craigslist-carstrucks-data" (use force=True to force download)
vehicles_path="craigslist-carstrucks-data/vehicles.csv"
#reads the csv file as pandas dataframe
vehicles_df=pd.read_csv(vehicles_path)
#To check the csv
vehicles_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 426880 entries, 0 to 426879 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 426880 non-null int64 1 url 426880 non-null object 2 region 426880 non-null object 3 region_url 426880 non-null object 4 price 426880 non-null int64 5 year 425675 non-null float64 6 manufacturer 409234 non-null object 7 model 421603 non-null object 8 condition 252776 non-null object 9 cylinders 249202 non-null object 10 fuel 423867 non-null object 11 odometer 422480 non-null float64 12 title_status 418638 non-null object 13 transmission 424324 non-null object 14 VIN 265838 non-null object 15 drive 296313 non-null object 16 size 120519 non-null object 17 type 334022 non-null object 18 paint_color 296677 non-null object 19 image_url 426812 non-null object 20 description 426810 non-null object 21 county 0 non-null float64 22 state 426880 non-null object 23 lat 420331 non-null float64 24 long 420331 non-null float64 25 posting_date 426812 non-null object dtypes: float64(5), int64(2), object(19) memory usage: 84.7+ MB
Handling missing data,
Handling redundant data,
Handling inconsistent data,
Remove Unnecessary data.
#list of columns which can be useful for our analysis..
selected_cols=['region','price','year','manufacturer','model','condition',
'cylinders','fuel','odometer','title_status','transmission'
,'drive','size','type','lat','long','posting_date','state']
#With Selected Columns we create another dataframe ..
df_selected =pd.read_csv(vehicles_path, usecols=selected_cols)
df_selected.shape
(426880, 18)
df_selected.duplicated().sum() #checking how many duplicates in total
37
df_selected.drop_duplicates() #droping duplicate from the data frame
| region | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | transmission | drive | size | type | state | lat | long | posting_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | prescott | 6000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | az | NaN | NaN | NaN |
| 1 | fayetteville | 11900 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ar | NaN | NaN | NaN |
| 2 | florida keys | 21000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | fl | NaN | NaN | NaN |
| 3 | worcester / central MA | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ma | NaN | NaN | NaN |
| 4 | greensboro | 4900 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nc | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 426875 | wyoming | 23590 | 2019.0 | nissan | maxima s sedan 4d | good | 6 cylinders | gas | 32226.0 | clean | other | fwd | NaN | sedan | wy | 33.786500 | -84.445400 | 2021-04-04T03:21:31-0600 |
| 426876 | wyoming | 30590 | 2020.0 | volvo | s60 t5 momentum sedan 4d | good | NaN | gas | 12029.0 | clean | other | fwd | NaN | sedan | wy | 33.786500 | -84.445400 | 2021-04-04T03:21:29-0600 |
| 426877 | wyoming | 34990 | 2020.0 | cadillac | xt4 sport suv 4d | good | NaN | diesel | 4174.0 | clean | other | NaN | NaN | hatchback | wy | 33.779214 | -84.411811 | 2021-04-04T03:21:17-0600 |
| 426878 | wyoming | 28990 | 2018.0 | lexus | es 350 sedan 4d | good | 6 cylinders | gas | 30112.0 | clean | other | fwd | NaN | sedan | wy | 33.786500 | -84.445400 | 2021-04-04T03:21:11-0600 |
| 426879 | wyoming | 30590 | 2019.0 | bmw | 4 series 430i gran coupe | good | NaN | gas | 22716.0 | clean | other | rwd | NaN | coupe | wy | 33.779214 | -84.411811 | 2021-04-04T03:21:07-0600 |
426843 rows × 18 columns
#Checking Data Once again
df_selected.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 426880 entries, 0 to 426879 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 region 426880 non-null object 1 price 426880 non-null int64 2 year 425675 non-null float64 3 manufacturer 409234 non-null object 4 model 421603 non-null object 5 condition 252776 non-null object 6 cylinders 249202 non-null object 7 fuel 423867 non-null object 8 odometer 422480 non-null float64 9 title_status 418638 non-null object 10 transmission 424324 non-null object 11 drive 296313 non-null object 12 size 120519 non-null object 13 type 334022 non-null object 14 state 426880 non-null object 15 lat 420331 non-null float64 16 long 420331 non-null float64 17 posting_date 426812 non-null object dtypes: float64(4), int64(1), object(13) memory usage: 58.6+ MB
df_selected.isna().sum()
region 0 price 0 year 1205 manufacturer 17646 model 5277 condition 174104 cylinders 177678 fuel 3013 odometer 4400 title_status 8242 transmission 2556 drive 130567 size 306361 type 92858 state 0 lat 6549 long 6549 posting_date 68 dtype: int64
df_selected['price']= df_selected[df_selected['price'] < 40000]['price']
missing_percentages = df_selected.isna().sum().sort_values(ascending = False) / len(df_selected)
missing_percentages[missing_percentages != 0]
missing_percentages[missing_percentages != 0].plot(kind='barh');
df_selected['size'] = df_selected['size'].fillna('unknown')
df_selected.rename(columns = {'size':'carsize'}, inplace = True)
df_selected['cylinders'].value_counts()
6 cylinders 94169 4 cylinders 77642 8 cylinders 72062 5 cylinders 1712 10 cylinders 1455 other 1298 3 cylinders 655 12 cylinders 209 Name: cylinders, dtype: int64
df_selected['cylinders'] = df_selected['cylinders'].fillna('12 cylinders')
df_selected['condition'].value_counts()
good 121456 excellent 101467 like new 21178 fair 6769 new 1305 salvage 601 Name: condition, dtype: int64
df_selected['condition'] = df_selected['condition'].fillna('good')
df_selected['drive'].value_counts()
4wd 131904 fwd 105517 rwd 58892 Name: drive, dtype: int64
df_selected['drive'] = df_selected['drive'].replace(['4wd'],'awd')
df_selected['drive'] = df_selected['drive'].fillna('rwd')
df_selected['type'] = df_selected['type'].fillna('hardtop')
df_selected['type'] = df_selected['type'].replace(['pickup'],'convertible')
df_selected['manufacturer'] = df_selected['manufacturer'].fillna('unknown')
df_selected.title_status.value_counts()
clean 405117 rebuilt 7219 salvage 3868 lien 1422 missing 814 parts only 198 Name: title_status, dtype: int64
df_selected['title_status'] = df_selected['title_status'].fillna('clean')
df_selected['lat'] = df_selected['lat'].fillna(33.669601)
df_selected['long'] = df_selected['long'].fillna(-86.817617)
df_selected['model'] = df_selected['model'].fillna('unknown')
df_selected['odometer'] = df_selected['odometer'].fillna(0)
df_selected['fuel'].value_counts()
gas 359222 other 30728 diesel 30062 hybrid 5170 electric 1698 Name: fuel, dtype: int64
df_selected['fuel'] = df_selected['fuel'].fillna('gas')
df_selected['transmission'] = df_selected['transmission'].fillna('other')
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import xticks
sns.set_style("darkgrid")
df_comp_avg_price = df_selected[['manufacturer','price']].groupby("manufacturer", as_index = False).mean().rename(columns={'price':'brand_avg_price'})
plt1 = df_comp_avg_price.plot(x = 'manufacturer', kind='bar',legend = False, sort_columns = True, figsize = (15,5))
plt1.set_xlabel("Brand")
plt1.set_ylabel("Avg Price ($)")
plt1.invert_xaxis()
xticks(rotation = 90)
plt.show()
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\2369741136.py:2: FutureWarning: `sort_columns` is deprecated and will be removed in a future version.
df_selected['manufacturer'] = df_selected['manufacturer'].replace(['ram'],'ram trucks')
largetosmall = df_selected.groupby('manufacturer').size().sort_values().index[::-1]
print(largetosmall)
Index(['ford', 'chevrolet', 'toyota', 'honda', 'nissan', 'jeep', 'ram trucks',
'unknown', 'gmc', 'bmw', 'dodge', 'mercedes-benz', 'hyundai', 'subaru',
'volkswagen', 'kia', 'lexus', 'audi', 'cadillac', 'chrysler', 'acura',
'buick', 'mazda', 'infiniti', 'lincoln', 'volvo', 'mitsubishi', 'mini',
'pontiac', 'rover', 'jaguar', 'porsche', 'mercury', 'saturn',
'alfa-romeo', 'tesla', 'fiat', 'harley-davidson', 'ferrari', 'datsun',
'aston-martin', 'land rover', 'morgan'],
dtype='object', name='manufacturer')
fig, ax = plt.subplots(figsize = (15,5))
plt1 = sns.countplot(x= df_selected['manufacturer'], order = largetosmall)
plt1.set(xlabel = 'Brand', ylabel= 'Count of Cars')
xticks(rotation = 90)
plt.show()
plt.tight_layout()
<Figure size 640x480 with 0 Axes>
salesorder = df_selected.groupby('state').size().sort_values().index[::-1]
fig, ax = plt.subplots(figsize = (15,5))
plt1 = sns.countplot(x= df_selected['state'], order = salesorder)
plt1.set(xlabel = 'state', ylabel= 'Cars Sales')
xticks(rotation = 90)
plt.show()
plt.tight_layout()
<Figure size 640x480 with 0 Axes>
scat_df_fuel=df_selected.query('fuel=="diesel" or fuel=="electric" or fuel == "gas" or fuel == "hybrid"')
#filters out the column for any incorrect values
df_fuel_avg_price = scat_df_fuel[['fuel','price']].groupby("fuel", as_index = False).mean().rename(columns={'price':'fuel_avg_price'})
plt1 = df_fuel_avg_price.plot(x = 'fuel', kind='bar',legend = False, sort_columns = True)
plt1.set_xlabel("Fuel Type")
plt1.set_ylabel("Avg Price ($)")
xticks(rotation = 0)
plt.show()
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\3971592624.py:5: FutureWarning: `sort_columns` is deprecated and will be removed in a future version.
scat_df_cartype=df_selected.query('type == "SUV" or type=="bus" or type=="convertible" or type=="coupe" or type=="hardtop" or type=="hatchback" or type=="mini-van" or type=="offroad" or type=="sedan" or type=="truck" or type=="van" or type=="wagon"')
df_body_avg_price = scat_df_cartype[['type','price']].groupby("type", as_index = False).mean().rename(columns={'price':'carbody_avg_price'})
plt1 = df_body_avg_price.plot(x = 'type', kind='bar',legend = False, sort_columns = True)
plt1.set_xlabel("Car Body")
plt1.set_ylabel("Avg Price ($)")
xticks(rotation = 90)
plt.show()
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\2397231527.py:4: FutureWarning: `sort_columns` is deprecated and will be removed in a future version.
df_drivewheel_avg_price = df_selected[['drive','price']].groupby("drive", as_index = False).mean().rename(columns={'price':'drivewheel_avg_price'})
plt1 = df_drivewheel_avg_price.plot(x = 'drive', kind='bar', sort_columns = True,legend = False,)
plt1.set_xlabel("Drive Wheel Type")
plt1.set_ylabel("Avg Price ($)")
xticks(rotation = 0)
plt.show()
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\2732342230.py:2: FutureWarning: `sort_columns` is deprecated and will be removed in a future version.
scat_df_carbody=df_selected.query('carsize=="compact" or carsize == "full-size" or carsize == "mid-size" or carsize == "sub-compact"')
df_carsize = scat_df_carbody[['carsize','price']].groupby("carsize", as_index = False).mean().rename(columns={'price':'dimens_avg_price'})
plt2 = df_carsize.plot(x = 'carsize', kind='bar', sort_columns = True,legend = False,)
plt2.set_xlabel("Car Size Type")
plt2.set_ylabel("Avg Price ($)")
xticks(rotation = 0)
plt.show()
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\3044735781.py:4: FutureWarning: `sort_columns` is deprecated and will be removed in a future version.
df_cylindernumber=df_selected.query('cylinders == "10 cylinders" or cylinders == "12 cylinders" or cylinders == "3 cylinders" or cylinders == "4 cylinders" or cylinders == "5 cylinders" or cylinders == "6 cylinders" or cylinders == "8 cylinders"')
df_cylindernumber_avg_price = df_cylindernumber[['cylinders','price']].groupby("cylinders", as_index = False).mean().rename(columns={'price':'cylindernumber_avg_price'})
plt1 = df_cylindernumber_avg_price.plot(x = 'cylinders', kind='bar', sort_columns = True,legend = False)
plt1.set_xlabel("Cylinder Number")
plt1.set_ylabel("Avg Price ($)")
xticks(rotation = 90)
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\2196906222.py:3: FutureWarning: `sort_columns` is deprecated and will be removed in a future version.
(array([0, 1, 2, 3, 4, 5, 6]), [Text(0, 0, '10 cylinders'), Text(1, 0, '12 cylinders'), Text(2, 0, '3 cylinders'), Text(3, 0, '4 cylinders'), Text(4, 0, '5 cylinders'), Text(5, 0, '6 cylinders'), Text(6, 0, '8 cylinders')])
df_selected['year']=df_selected[df_selected.year >= 1970]['year']
sns.distplot(df_selected['year'])
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\3022398356.py:2: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<AxesSubplot: xlabel='year', ylabel='Density'>
df=df_selected.sort_values(by=['year'],ascending=False)
plt.figure(figsize=(25,10))
sns.barplot(x=df.year, y=df.price)
plt.xticks(rotation= 90)
plt.xlabel('Year')
plt.ylabel('Price($)')
plt.show()
scat_df_tesla=df_selected.query('manufacturer=="tesla"')
#filters out the tesla cars
scat_df_tesla=scat_df_tesla[scat_df_tesla.year>=1950]
scat_df_tesla= scat_df_tesla[scat_df_tesla.year<2023]
scat_df_tesla
| region | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | transmission | drive | carsize | type | state | lat | long | posting_date | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1008 | birmingham | 29890.0 | 2013.0 | tesla | model s | good | 12 cylinders | electric | 61591.0 | clean | automatic | rwd | unknown | sedan | al | 33.669601 | -86.817617 | 2021-04-20T04:50:54-0500 | POINT (-86.81762 33.66960) |
| 1333 | birmingham | 32990.0 | 2013.0 | tesla | model s | good | 12 cylinders | electric | 69961.0 | clean | automatic | rwd | unknown | sedan | al | 33.669601 | -86.817617 | 2021-04-14T04:50:50-0500 | POINT (-86.81762 33.66960) |
| 1357 | birmingham | 38750.0 | 2018.0 | tesla | model 3 | good | 12 cylinders | electric | 14452.0 | clean | automatic | rwd | unknown | hardtop | al | 33.499500 | -86.726800 | 2021-04-13T09:37:20-0500 | POINT (-86.72680 33.49950) |
| 1806 | birmingham | 29890.0 | 2013.0 | tesla | model s | good | 12 cylinders | electric | 61591.0 | clean | automatic | rwd | unknown | sedan | al | 33.669601 | -86.817617 | 2021-04-04T10:01:09-0500 | POINT (-86.81762 33.66960) |
| 2755 | huntsville / decatur | NaN | 2015.0 | tesla | model s | good | 12 cylinders | electric | 23906.0 | clean | automatic | rwd | unknown | sedan | al | 36.138037 | -86.731163 | 2021-04-30T12:13:15-0500 | POINT (-86.73116 36.13804) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 424599 | milwaukee | NaN | 2020.0 | tesla | model 3 long range sedan | good | 12 cylinders | electric | 3996.0 | clean | other | rwd | unknown | sedan | wi | 43.060000 | -87.960000 | 2021-04-18T08:02:43-0500 | POINT (-87.96000 43.06000) |
| 425311 | sheboygan | 36590.0 | 2019.0 | tesla | model 3 standard range | good | 12 cylinders | electric | 23057.0 | clean | other | rwd | unknown | sedan | wi | 43.700000 | -87.740000 | 2021-04-30T11:51:04-0500 | POINT (-87.74000 43.70000) |
| 425611 | wausau | 37990.0 | 2019.0 | tesla | model 3 standard range | good | 12 cylinders | electric | 22544.0 | clean | other | rwd | unknown | sedan | wi | 44.960000 | -89.640000 | 2021-05-03T15:11:06-0500 | POINT (-89.64000 44.96000) |
| 425747 | wausau | 32990.0 | 2013.0 | tesla | model s signature | good | 12 cylinders | electric | 62588.0 | clean | other | rwd | unknown | sedan | wi | 44.960000 | -89.640000 | 2021-04-27T13:51:11-0500 | POINT (-89.64000 44.96000) |
| 425794 | wausau | 35990.0 | 2019.0 | tesla | model 3 standard range | good | 12 cylinders | gas | 31540.0 | clean | other | rwd | unknown | sedan | wi | 44.960000 | -89.640000 | 2021-04-26T12:21:09-0500 | POINT (-89.64000 44.96000) |
868 rows × 19 columns
plt.figure(figsize=(25,10))
sns.barplot(x=scat_df_tesla.year, y=scat_df_tesla.price)
plt.xticks(rotation= 90)
plt.xlabel('Year')
plt.ylabel('Price($)')
plt.show()
C:\Users\offic\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\algorithms.py:98: RuntimeWarning: Mean of empty slice C:\Users\offic\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\algorithms.py:98: RuntimeWarning: Mean of empty slice
scat_df_ford=df_selected.query('manufacturer == "ford"')
#filters out the year column for any incorrect values
scat_df_ford=scat_df_ford[scat_df_ford.year>=1990]
scat_df_ford= scat_df_ford[scat_df_ford.year<2023]
scat_df_ford
| region | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | transmission | drive | carsize | type | state | lat | long | posting_date | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31 | auburn | 15000.0 | 2013.0 | ford | f-150 xlt | excellent | 6 cylinders | gas | 128000.0 | clean | automatic | rwd | full-size | truck | al | 32.592000 | -85.518900 | 2021-05-03T14:02:03-0500 | POINT (-85.51890 32.59200) |
| 44 | auburn | 30990.0 | 2019.0 | ford | ranger supercrew xl pickup | good | 12 cylinders | other | 1834.0 | clean | other | rwd | unknown | convertible | al | 32.590000 | -85.480000 | 2021-05-01T09:21:34-0500 | POINT (-85.48000 32.59000) |
| 47 | auburn | 34590.0 | 2018.0 | ford | f150 super cab xl pickup 4d | good | 6 cylinders | gas | 20856.0 | clean | other | rwd | unknown | convertible | al | 32.590000 | -85.480000 | 2021-04-30T12:51:31-0500 | POINT (-85.48000 32.59000) |
| 50 | auburn | 38990.0 | 2020.0 | ford | f150 supercrew cab xlt | good | 6 cylinders | gas | 12231.0 | clean | other | rwd | unknown | convertible | al | 32.590000 | -85.480000 | 2021-04-30T10:11:00-0500 | POINT (-85.48000 32.59000) |
| 53 | auburn | 27990.0 | 2020.0 | ford | ranger supercab xl pickup | good | 12 cylinders | gas | 10688.0 | clean | other | rwd | unknown | convertible | al | 32.590000 | -85.480000 | 2021-04-29T18:11:11-0500 | POINT (-85.48000 32.59000) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 426791 | wyoming | 32995.0 | 2013.0 | ford | f-350 super duty 4x2 2dr re | good | 12 cylinders | gas | 51000.0 | clean | automatic | rwd | unknown | hardtop | wy | 36.114900 | -115.216100 | 2021-04-07T10:05:03-0600 | POINT (-115.21610 36.11490) |
| 426792 | wyoming | 32999.0 | 2014.0 | ford | f350, xlt | excellent | 8 cylinders | diesel | 154642.0 | clean | automatic | awd | full-size | convertible | wy | 41.138284 | -104.784799 | 2021-04-07T09:06:54-0600 | POINT (-104.78480 41.13828) |
| 426813 | wyoming | 1000.0 | 1998.0 | ford | explorer | good | 6 cylinders | gas | 190000.0 | clean | automatic | rwd | unknown | hardtop | wy | 41.183600 | -104.802300 | 2021-04-06T15:40:45-0600 | POINT (-104.80230 41.18360) |
| 426817 | wyoming | 20590.0 | 2018.0 | ford | edge se sport utility 4d | good | 12 cylinders | gas | 37164.0 | clean | other | fwd | unknown | SUV | wy | 33.786500 | -84.445400 | 2021-04-06T03:50:18-0600 | POINT (-84.44540 33.78650) |
| 426844 | wyoming | 13977.0 | 2014.0 | ford | flex | good | 12 cylinders | gas | 112061.0 | clean | automatic | rwd | unknown | SUV | wy | 45.762900 | -108.539000 | 2021-04-05T10:42:43-0600 | POINT (-108.53900 45.76290) |
68486 rows × 19 columns
plt.figure(figsize=(25,10))
sns.barplot(x=scat_df_ford.year, y=scat_df_ford.price)
plt.xticks(rotation= 90)
plt.xlabel('Year')
plt.ylabel('Price($)')
plt.show()
columns_od = 'manufacturer == "ford" or manufacturer == "chevrolet" or manufacturer == "toyota" or manufacturer == "honda" or manufacturer == "nissan" or manufacturer == "jeep" or manufacturer == "ramtrucks" or manufacturer == "gmc" or manufacturer == "bmw" or manufacturer == "dodge" or manufacturer == "mercedes-benz" or manufacturer == "hyundai" or manufacturer == "subaru" or manufacturer == "volkswagen" or manufacturer == "kia" or manufacturer == "lexus" or manufacturer == "audi" or manufacturer == "cadillac" or manufacturer == "chrysler" or manufacturer == "acura" or manufacturer == "buick" or manufacturer == "mazda" or manufacturer == "infiniti" or manufacturer == "lincoln" or manufacturer == "volvo" or manufacturer == "mitsubishi" or manufacturer == "mini" or manufacturer == "pontiac" or manufacturer == "rover" or manufacturer == "jaguar" or manufacturer == "porsche" or manufacturer == "mercury" or manufacturer == "saturn" or manufacturer == "alfa-romeo" or manufacturer == "tesla" or manufacturer == "fiat" or manufacturer == "harley-davids" or manufacturer == "ferrari" or manufacturer == "datsun" or manufacturer == "aston-martin" or manufacturer == "landrover" or manufacturer == "morgan3"'
odometer_df = df_selected.query(columns_od)
plt.figure(figsize=(25,10))
sns.barplot(x=odometer_df.manufacturer, y=odometer_df.odometer)
plt.xticks(rotation= 90)
plt.xlabel('Manufacturer')
plt.ylabel('Odometer')
plt.show()
scat_df=df_selected.query('manufacturer=="ford" or manufacturer=="bmw" or manufacturer=="chevrolet" or manufacturer=="mercedes-benz" or manufacturer=="audi" or manufacturer=="jeep" or manufacturer=="toyota" or manufacturer=="tesla" or manufacturer=="volkswagen" or manufacturer=="lexus"')
#filters out the year column for any incorrect values
scat_df=scat_df[scat_df.year>=1950]
scat_df= scat_df[scat_df.year<2023]
#sets upper and lower limits for price column to eliminate incorrect entries
scat_df= scat_df[scat_df.price>10]
scat_df= scat_df[scat_df.price<40000]
#sets upper and lower limits for odometer column to eliminate incorrect entries
scat_df=scat_df[scat_df.odometer<=400000]
scat_df=scat_df[scat_df.odometer>=100]
#takes sample of a 20000 values from the obtained dataframe for the scatter plot
scat_sample=scat_df.sample(20000)
scat_plot1=px.scatter(scat_sample,x='price',y='odometer',color='manufacturer',hover_data=['year','model','condition'],template='plotly_dark',
title="Price-Distance Driven",color_discrete_sequence=px.colors.qualitative.Bold[3:7],
labels={
"price": "Price($)",
"odometer": "Distance Driven(Miles)",
"year":"Year",
"model":"Model",
"condition":"Condition",
"manufacturer":"Manufacturer"
})
scat_plot1.show()
#takes sample of a 3000 values from the obtained dataframe for the scatter plot
scat_sample2=scat_df.sample(3000)
scat_plot2=px.scatter(scat_sample2,x='price',y='odometer',color='manufacturer',hover_data=['year','model','condition'],template='plotly_dark',
title="Price-Distance Driven",color_discrete_sequence=px.colors.qualitative.Bold[3:7],
labels={
"price": "Price($)",
"odometer": "Distance Driven(Miles)",
"year":"Year",
"model":"Model",
"condition":"Condition",
"manufacturer":"Manufacturer"
})
scat_plot2.show()
plt.figure(figsize=(25,10))
scat_df_fuel=df_selected.query('manufacturer=="ford" or manufacturer=="bmw" or manufacturer=="chevrolet" or manufacturer=="mercedes-benz" or manufacturer=="audi" or manufacturer=="jeep" or manufacturer=="toyota" or manufacturer=="tesla" or manufacturer=="volkswagen" or manufacturer=="lexus"')
scat_df_fuel=scat_df_fuel[scat_df_fuel.year>=1950]
scat_df_fuel= scat_df_fuel[scat_df_fuel.year<2023]
#sets upper and lower limits for price column to eliminate incorrect entries
scat_df_fuel= scat_df_fuel[scat_df_fuel.price>10]
scat_df_fuel= scat_df_fuel[scat_df_fuel.price<40000]
#sets upper and lower limits for odometer column to eliminate incorrect entries
scat_df_fuel=scat_df_fuel[scat_df_fuel.odometer<=400000]
scat_df_fuel=scat_df_fuel[scat_df_fuel.odometer>=100]
sns.lineplot(x = "year", y = "odometer", data = scat_df_fuel, hue = "manufacturer")
plt.show()
plt.figure(figsize=(35,10))
scat_df_fuel2=df_selected.query('year >= 1970')
scat_df_fuel2=scat_df_fuel2[scat_df_fuel2.year>=1970]
scat_df_fuel2= scat_df_fuel2[scat_df_fuel2.year<2023]
#sets upper and lower limits for price column to eliminate incorrect entries
scat_df_fuel2= scat_df_fuel2[scat_df_fuel2.price>10]
scat_df_fuel2= scat_df_fuel2[scat_df_fuel2.price<40000]
#sets upper and lower limits for odometer column to eliminate incorrect entries
scat_df_fuel2=scat_df_fuel2[scat_df_fuel2.odometer<=400000]
scat_df_fuel2=scat_df_fuel2[scat_df_fuel2.odometer>=100]
sns.lineplot(x = "year", y = "odometer", data = scat_df_fuel2, hue = "manufacturer")
plt.show()
from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame
geo_df = df_selected.query('state == "ks" or state == "wy" or state == "tn" or state == "nv" or state == "mt"')
geometry = [Point(xy) for xy in zip(geo_df['long'], geo_df['lat'])]
gdf = GeoDataFrame(geo_df, geometry=geometry)
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
#Dropping Anatartica as we are not using
world = world[(world.name != "Antarctica") & (world.name != "Fr. S. Antarctic Lands")]
gdf.plot(ax=world.plot(figsize=(10, 6),cmap='Greys',edgecolor='black'), facecolor='white', color='red', markersize=15);
plt.figure(figsize=(25,10))
sns.violinplot(x=df_selected.year, y=df_selected.condition)
plt.xticks(rotation= 90)
plt.xlabel('Year')
plt.ylabel('Condition')
plt.show()
auto = df_selected[['fuel', 'type', 'drive', 'carsize',
'cylinders','price', 'odometer']]
plt.figure(figsize=(18, 20))
plt.subplot(4,2,1)
scat_df_fuel=df_selected.query('fuel=="diesel" or fuel=="electric" or fuel == "gas" or fuel == "hybrid"')
sns.boxplot(x = 'fuel', y = 'price', data = scat_df_fuel)
plt.subplot(4,2,2)
sns.boxplot(x = 'cylinders', y = 'price', data = df_selected)
plt.subplot(4,2,3)
sns.boxplot(x = 'type', y = 'price', data = df_selected)
plt.subplot(4,2,4)
sns.boxplot(x = 'drive', y = 'price', data = df_selected)
plt.subplot(4,2,5)
sns.boxplot(x = 'carsize', y = 'price', data = df_selected)
plt.subplot(4,2,6)
sns.boxplot(x = 'condition', y = 'price', data = df_selected)
plt.tight_layout()
plt.show()
* Car price vs Brand
* Top Manufactures(Count of cars vs Brand)
* Top Sale Trends with state(Car sales vs States)
* Fuel Type vs Price
* car body vs price
* drive wheel vs price
* car dimensions(overall-size) vs price
* cylinders vs price
* odometer vs price
* Top 10 Most Driven Cars Considering Condition,Model,Price,Brand, Model Year, Distance
* Brand - Year -Condition
*numpy, pandas, pyspark, matplotlib, seaborn, folium, opendatasets, geopandas, ploty, xticks, pyplot, bucket
*bar, barh, countplot, seaborn barplot, scatter plot, violin plot,box plot,lineplot